# Lab 20 - Correlation, causation, and heat maps

The Federal Reserve Bank of New York has information about the labor market for recent college graduates [here](https://www.newyorkfed.org/research/college-labor-market/college-labor-market_compare-majors.html).

The data in this table can be downloaded as an Excel file at the bottom of the page. If you open this file in Excel, you can save the last table as a CSV file. Alternatively, download the data as a CSV file from the course website. 

We will be using a new data visualization library called [Seaborn](https://seaborn.pydata.org/#). Install it with the code below (this might take a few minutes):

In [None]:
!pip install --user seaborn

Import Seaborn and the other libraries so we can use them in our code.

In [32]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

### Loading and cleaning the data

Before creating the dataframe, look at the CSV file. Do you notice anything that you will have to account for when reading in the data? 

Hint: To skip 5 rows at the end of the CSV file, use the parameter `skipfooter = 5`. (Ignore the warning.)

Load the labor market CSV file into a dataframe called `labor`:

Display your `labor` dataframe below to check it.

We will want to use all the columns as numbers, but has `read_csv` interpreted them as numbers? Type `labor.dtypes` below and run the code to see what *type* each column is. `dtypes` is a property of the dataframe, not a function, so it doesn't have `()` at the end. 

Decimal numbers are stored as `float64`, so the `Unemployment Rate`, `Underemployment Rate`, and `Share with Graduate Degree` have been read in as numbers. The `Median Wage Early Career` and `Median Wage Mid-Career` columns have type `object` which probably means a string. These columns were read in as strings because they contains commas.

The following code will remove the commas and tell Pandas that these columns are `float`s as well.

In [None]:
labor["Median Wage Early Career"] = labor["Median Wage Early Career"].str.replace(",","").astype(float)
labor["Median Wage Mid-Career"] = labor["Median Wage Mid-Career"].str.replace(",","").astype(float)

Check that this code worked by displaying `labor` again:

Show the type for each column:

### Correlation matrix

We can compute the correlation matrix with the code `labor.corr()`. Try it below.

Which pair of columns are the most correlated? Which pair of columns are the least correlated?

### Heat maps

We can visualize the correlation matrix with a heat map. To do this, we need to save the correlation matrix in a variable (say `corr_matrix`) and then run the code `sns.heatmap(corr_matrix)`. Try it below.

Find the pairs of columns that you thought were most correlated and least correlated above. Do the colors for these pairs make sense?

Make a scatter plot of the two columns that are most correlated. Remember the code pattern for creating a scatter plot is `df.plot.scatter(x = "column name 1", y = "column name 2")`

Does the relationship look linear (like a line)?

Now make a scatter plot of the two columns that were least correlated.

Does this relationship look linear?

We can plot all possible scatter plots at once with the Seaborn command `sns.pairplot(labor)`. Try it below.

Can you spot the pairs of columns with negative correlation from the plots? What plots are on the diagonal?

## Another heat map: Traffic counts

To fully see the power of heat maps, we need a larger data set. Download the dataset of traffic counts [here](https://data.cityofnewyork.us/Transportation/Traffic-Volume-Counts-2012-2013-/p424-amsu).

This dataset contains counts of the number of vehicles to pass different sections of road at different times of the day.

Load the CSV file into a variable called `traffic`

Display your new dataframe.

What are the columns?

Compute the correlation matrix.

Our correlation matrix is fairly large, so it is hard to see patterns in it. To visualize the correlation matrix, display it as a heatmap.

What do you notice about the heatmap? Which columns are similar? Does this make sense?

### Challenges:
- Which pair of numeric columns in the taxi dataset are the most correlated? Which pair of numeric columns in the taxi dataset are the least correlated? Do these results make sense?
- Which pair of numeric columns in the Starbucks drink nuitrition information are the most correlated? Which pair are the least correlated? Do these results make sense?